Search CORE

266 research outputs found

Coupled Ensembles of Neural Networks

Author: Dutt Anuvabh
Pellerin Denis
Quénot Georges
Publication venue
Publication date: 18/09/2017
Field of study

We investigate in this paper the architecture of deep convolutional networks. Building on existing state of the art models, we propose a reconfiguration of the model parameters into several parallel branches at the global network level, with each branch being a standalone CNN. We show that this arrangement is an efficient way to significantly reduce the number of parameters without losing performance or to significantly improve the performance with the same level of performance. The use of branches brings an additional form of regularization. In addition to the split into parallel branches, we propose a tighter coupling of these branches by placing the "fuse (averaging) layer" before the Log-Likelihood and SoftMax layers during training. This gives another significant performance improvement, the tighter coupling favouring the learning of better representations, even at the level of the individual branches. We refer to this branched architecture as "coupled ensembles". The approach is very generic and can be applied with almost any DCNN architecture. With coupled ensembles of DenseNet-BC and parameter budget of 25M, we obtain error rates of 2.92%, 15.68% and 1.50% respectively on CIFAR-10, CIFAR-100 and SVHN tasks. For the same budget, DenseNet-BC has error rate of 3.46%, 17.18%, and 1.8% respectively. With ensembles of coupled ensembles, of DenseNet-BC networks, with 50M total parameters, we obtain error rates of 2.72%, 15.13% and 1.42% respectively on these tasks

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

A decade of work on semantic concept detection from video: TRECVid’s contribution

Author: Kraaij Wessel
Quénot Georges
Smeaton Alan F.
Publication venue
Publication date: 13/12/2012
Field of study

DCU Online Research Access Service

LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation

Author: Ayache Stéphane
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2008
Field of study

International audienceThis paper describes our participations of LIG and LIRIS to the TRECVID 2008 High Level Features detection task. We evaluated several fusion strategies and especially rank fusion. Results show that including as many low-level and intermediate features as possible is the best strategy, that SIFT features are very important, that the way in which the fusion from the various low-level and intermediate features does matter, that the type of mean (arithmetic, geometric and harmonic) does matter. LIG and LIRIS best runs respectively have a Mean Inferred Average Precision of 0.0833 and 0.0598; both above TRECVID 2008 HLF detection task median performance. LIG and LIRIS also co-organized the TRECVID 2008 collaborative annotation. 40 teams did 1235428 annotations. The development collection was annotated at least once at 100\%, at least twice at 37.6\%, at least three times at 3.99\% and at least four times at 0.06\%. Thanks to the active learning and active cleaning used approach, the annotations that were done multiple times were those for which the risk of error was maximum

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Mots audio-visuels joints pour la détection de scènes violentes dans les vidéos

Author: Derbas Nadia
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

National audienceCe papier présente une représentation audio-visuelle des données pour la détection des scènes violentes dans les films. Les travaux existants dans ce domaine considèrent l'infor- mation visuelle ou l'information audio; voire leur fusion classique. Jusqu'à présent peu d'ap- proches ont exploré leur dépendance mutuelle pour la détection de scènes violentes. Ainsi, nous proposons un descripteur qui fournit des indices multimodaux audio et visuels; tout d'abord en assemblant les descripteurs audio et visuels, ensuite en révélant statistiquement les motifs conjoints multimodaux. La validation expérimentale a été effectuée dans le cadre de la tâche "détection de scènes violentes" de MediaEval 2013. Les résultats obtenus montrent le potentiel de l'approche proposée en comparaison avec les méthodes utilisant les descripteurs audio et visuels séparément ou d'autres types de fusion

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing

Author: Quénot Georges
Safadi Bahjat
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2015
Field of study

International audienceThis paper presents a set of improvements for SVM-based large scale multimedia indexing. The proposed method is particularly suited for the detection of many target concepts at once and for highly imbalanced classes (very infrequent concepts). The method is based on the use of multiple SVMs (MSVM) for dealing with the class imbalance and on some adaptations of this approach in order to allow for an efficient implementation using optimized linear algebra routines. The implementation also involves hashed structures allowing the factorization of computations between the multiple SVMs and the multiple target concepts, and is denoted as Factorized-MSVM.Experiments were conducted on a large-scale dataset, namely TRECVid 2012 semantic indexing task. Results show that the Factorized-MSVM performs as well as the original MSVM, but it is significantly much faster. Speed-ups by factors of several hundreds were obtained for the simultaneous classification of 346 concepts, when compared to the original MSVM implementation using the popular libSVM implementation

Crossref

Hal - Université Grenoble Alpes

Approche par patrons linguistiques pour la détection automatique du locuteur : application à l'indexation par le contenu des journaux télévisés

Author: Charhad Mbarek
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2005
Field of study

National audienceL'identité des personnes dans les documents audiovisuels représente une information sémantique importante pour un processus d'indexation et de recherche par le contenu. La tâche de détection de l'identité des locuteurs peut être réalisée en exploitant des éléments d'informations issues de différentes modalités (texte, image et son). Dans cet article, nous proposons une approche pour l'indexation de l'identité des locuteurs dans les journaux télévisés en exploitant le contenu audio. Après une phase de segmentation en locuteurs, une identité est attribuée à des segments de parole par l'intermédiaire de patrons linguistiques appliqués à leur transcription produite par reconnaissance vocale. Trois types de patrons sont utilisés pour prédire l'identité du locuteur dans les segments précédents, courants ou suivants. Ces prédictions sont ensuite propagées à d'autres segments par similarité au niveau acoustique. Des évaluations ont été menées sur une partie du corpus TREC 2003 : une identité de locuteur a pu être attribuée à 53% du corpus annoté avec une précision de 82%

Hal - Université Grenoble Alpes

Descriptor Optimization for Multimedia Indexing and Retrieval

Author: Quénot Georges
Safadi Bahjat
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceIn this paper, we propose and evaluate a method for optimizing descriptors used for content-based multimedia indexing and retrieval. A large variety of descriptors are commonly used for this purpose. However, the most efficient ones often have characteristics preventing them to be easily used in large scale systems. They may have very high dimensionality (up to tens of thousands dimensions) and/or be suited for a distance costly to compute (e.g. fflchi-square). The proposed method combines a PCA-based dimensionality reduction with pre- and post-PCA non-linear transformations. The resulting transformation is globally optimized. The produced descriptors have a much lower dimensionality while performing at least as well, and often significantly better, with the Euclidean distance than the original high dimensionality descriptors with their optimal distance. The method has been validated and evaluated for a variety of descriptors using TRECVid 2010 semantic indexing task data. It has then be applied at large scale for the TRECVid 2012 semantic indexing task on tens of descriptors of various types and with initial dimensionalities from 15 up to 32,768. The same transformation can be used also for multimedia retrieval in the context of query by example and/or relevance feedback

Hal - Université Grenoble Alpes

Semantic Video Content Indexing and Retrieval using Conceptual Graphs

Author: Charhad Mbarek
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2004
Field of study

International audienceIn this article, we propose a conceptual model for video content description. This model is an extension of the EMIR² model proposed for image representation and retrieval. The proposed extensions include the addition of some views such as temporal and event views that are specific to video documents, the extension of the structural view to the temporal structure of video documents, and the extension of the perceptive view to motion descriptors. We have kept the formalism of conceptual graphs for the representation of the semantic content. The various concepts and relations involved can be taken from general and/or domain specific ontologies and completed by lists of instances (individuals). The proposed model has been applied on TREC video 2002 and 2003 corpora that mainly contain TV news and commercials videos

Hal - Université Grenoble Alpes

Annotation de vidéos par paires rares de concepts

Author: Hamadi Abdelkader
Mulhem Philippe
Quénot Georges
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

National audienceLa détection d'un concept visuel dans les vidéos est une tâche difficile, spécialement pour les concepts rares ou pour ceux dont il est compliqué de décrire visuellement. Cette question devient encore plus difficile quand on veut détecter une paire de concepts au lieu d'un seul. En effet, plus le nombre de concepts présents dans une scène vidéo est grand, plus cette dernière est complexe visuellement, et donc la difficulté de lui trouver une description spécifique s'accroit encore plus. Deux directions principales peuvent eˆtre suivies pour tacler ce problème: 1) détecter chaque concept séparément et combiner ensuite les prédictions de leurs détecteurs correspondants d'une manière similaire à celle utilisée souvent en recherche d'information, ou 2) considérer le couple comme un nouveau concept et générer un classifieur supervisé pour ce nouveau concept en inférant de nouvelles annotations à partir de celles des deux concepts formant la paire. Chacune de ces approches a ses avantages et ses inconvénients. Le problème majeur de la deuxième méthode est la nécessité d'un ensemble de données annotées, surtout pour la classe positive. S'il y a des concepts rares, cette rareté s'accroit encore plus pour les paires formées de leurs combinaisons. D'une autre part, il peut y avoir deux concepts assez fréquents mais il est très rare qu'ils occurrent conjointement dans un meˆme document. Certains travaux de l'état de l'art ont proposé de palier ce problème en récoltant des exemples représentatifs des classes étudiées du web, mais cette tâche reste couˆteuse en temps et argent. Nous avons comparé les deux types d'approches sans recourir à des ressources externes. Notre évaluation a été réalisée dans le cadre de la sous-tâche "détection de paire de concepts" de la tâche d'indexation sémantique (SIN) de TRECVID 2013, et les résultats ont révélé que pour le cas des vidéos, si on n'utilise pas de ressources d'information externes, les approches qui fusionnent les résultats des deux détecteurs sont plus performantes, contrairement à ce qui a été montré dans des travaux antérieurs pour le cas des images fixes. La performance des méthodes décrites dépasse celle du meilleur résultat officiel de la campagne d'évaluation précédemment citée, de 9% en termes de gain relatif sur la précision moyenne (MAP)

Hal - Université Grenoble Alpes

Temporal re-scoring vs. temporal descriptors for semantic indexing of videos

Author: Hamadi Abdelkader
Mulhem Philippe
Quénot Georges
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2015
Field of study

International audienceThe automated indexing of image and video is a difficult problem because of the "distance" between the arrays of numbers encoding these documents and the concepts (e.g. people, places, events or objects) with which we wish to annotate them. Methods exist for this but their results are far from satisfactory in terms of generality and accuracy. Existing methods typically use a single set of such examples and consider it as uniform. This is not optimal because the same concept may appear in various contexts and its appearance may be very different depending upon these contexts. The context has been widely used in the state of the art to treat various problems. However, the temporal context seems to be the most crucial and the most effective for the case of videos. In this paper, we present a comparative study between two methods exploiting the temporal context for semantic video indexing. The proposed approaches use temporal information that is derived from two different sources: low-level content and semantic information. Our experiments on TRECVID'12 collection showed interesting results that confirm the usefulness of the temporal context and demonstrate which of the two approaches is more effective

Crossref

Hal - Université Grenoble Alpes